Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
作者信息
莱斯大学
摘要
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. 【只有比较重要的token才会显著影响后续生成】Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model【固定KV Cache Budget】. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.
一句话总结概括
保留重要的KV Cache
Motivation
Repetitive Attention Pattern
取了三个不同的位置,Attention分数高的地方都是比较相似的
Persistence of Importance Hypothesis
只有在上一个step影响比较大的token,才会在下一个step生成中发挥重要作用。
对于生成的token,它attention计算中分数比较高token,在之前token生成的attention计算中值就应该比较高。